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(54 ) Method and apparatus for managing membership of a group of processors in a distributed 
computing environment 

may tail and then are removed from the group. The 
group ot members also receives mutticasts initiated 
trom one member of the group to other members of the 
group Additionally, each group of processors within a 
distributed computing environment has a group leader 
that controls the actions being performed for the group 
of members 



(57) Membership of a group of processors in a dis- 
tributed computing environment is managed. Specific 
actions are identified and performed in order to manage 
the group membership. A processor requests to join tne 
group of processors and thus, is added to the group 
Similarly, processors may request to leave tne group or 
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Description 
tc^mmiCAL FIELD 

Th,s invention relates .n general to d.str.buted 
computmg env.ronments and ,n pa^.ou.ar to r^o^og 
membership within a group ot processors of a d.str.but 
ed computing environment. 

pAfK-rtHOUND ART 

,n typical computing systems there is a Preened 
confiauration ,n which a number of P^**™ £ 
?med' These processors may be active or '"ac^Ac 
Z processors receive applications to process and ex- 
ecuteme applications ,n accorcance with the system 

COn, T 9 h U , r s a prede»,ned configuration ,s very mf.exib.e as 
,ne processors are defined into the conf.gurat.on .n ad- 
vance and must stay m that configuration. 

niSCLOSUP c nc ™ p INVENTION 



Accordma to one aspect, the present invention pro- 
v.aes a method for joining a group of processors in a 
d.str.buteo computing env.ronment said method conn- 
oris.no reouesting. by a processor, to ,om a group of 
processors said grouo of processors executing , related 
processes; and aoo.ng said processor -to sa.d group o, 

Pr ° C Accord.ng to a preferred embodiment the invention 
further comprising informing said group of processors 
ol said join and further comprising sending a message 
from one processor of said group of processors to any 
other processors of said group of processors Prefera- • 
blv the invention further .ncludes removing a processor 
from said group of processors further preferably 
wherein said removing comprises deleting sa.d proces- 
sor from a membership list of sa.d group of processors 
and/or further comprising .nform.ng said group of proc- 
essors of the removal of sa.d processor and/or remov- 
,ng said processor from said group of processors when 
sa.d processor fails and/or when sa.d processor re- 
quests to leave said group of processors 

According to a preferred embodiment sa.d request- 
.ng compnses prov.d.ng a request to join sa.d group of 
processors to a group leader processor of said group of 
processor, and/or further wherein sa.d adding further 
compnses informing by sa.d group leader processor 
any other processors of sa.d group of processors of sa.d 
,o.n and/or wherein said prov.d.ng comprises locating 
sa.d group leader processor via a name server sa.d 
name server comprising a processor of said distributed 
computing environment Preferably, sa.d name server 
is a member ol said group of processors In another em- 
bodiment, said name server is independent of sa.d 
group of processors According to another embodiment 
said adding comprises updating a membership list of 



said orouo of processors preferably sa.d me-.persn:c 
l,st ,s~locateo ct each processor of saia group o: proc- 
essors and said uodating comprises informing esc- 
processor of sa.d group of processors of sa.o pin arc 
s updating sa.d memoersnip list by each processor 

Accoramg to a second aspect the invention pro- 
vides a metnod for ma.nta.n.ng groups of processors in 
a d.str.buted comput.ng environment sa.d method com- 
prising identifying a spec.tied action to be taken fo- a 
ic qroup of processors ot said distriduteo computing envi- 
ronment sa-d group of processors including one or 
more memoer processors, eacn of sa.d one or more 
member processors .ncluo.ng a related process and 
performing sa.d specified action for sa.d grouo of oroc- 

eSS °p^eferably said specified action comprises one of 
the following (a) insert, wherein a processor .s request- 
,ng to ,o.n sa.d group of processors: (M multicast where- 
,n one of said one or more member processors .s re- 
20 questing to forward a message to any other member 
processors of sa,d group of processors (O leave 
wherein one of said one or more member processors is 
requesting to leave said group of processors (d) re- 
move wherein one member of sa.d one or more mem- 
*s per processors is removed from sa.d group of proces- 
sors when said one member fa.ls and (e) ma.ntaming 
a qroup leader for said group of processors 

A-cord.no to third and fourtn aspects the invention 
also provides'systems for carrying out the methods re- 
30 (erred to above with respective means for carrying out 
the corresponding method steps 

According to fifth and sixth aspects the invention 
also provides computer program products stored on 
computer readable storage media lor .nstruct.ng com- 
35 puter systems to carry out the methods referred to 
above, such products hav.ng means lor carry.ng out the 
corresponding method steps 

The group membership management technique ol 
the present invention advantageously enables actions 
40 to be performed on a group basis. Each group includes 
processors that are each executing for .nstance. a sin- 
gle Group Services daemon that takes part in imple- 
menting the group actions The groups of processors 
are collectively referred to as a metagroup layer which 
js provides a simple mechanism tor performing group ac- 

l 0nS Accordingly. the present invention allows a proces- 
sor to become a member of a group ol processors in 
which the group of processors execute related process- 
so es That is the invention enables actions to be per- 
formed on a group basis, allowing processors to request 
to become a member ol a group The invention also al- 
lows a member of a processor group to leave the group 
or be removed from the group Further the invention en- 
ss ables messages to be multicast to group members 
Thus the present invention makes the overall comput- 
.ng system very flexible i.-. terms of processor configu- 
rations 
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Accoramg to a seventn aspect, trie present mven- 
tion proviaes a metnod tor recovering from a tailed group 
leaaer of a arouo of processors of a distributed comput- 
ing environment, said metnod comprising steps of- ob- 
taining from a membership list oroered in sequence of 
joins of processors to said group of processors a next 
processor in said memoership list: and selecting said 
next processor as a new group leader of said group of 
processors. 

Pref eraoiy. said obtaining step comprises obtaining 
a next active processor from said memoersnip list. Pref- 
erably, said method includes a further step of informing 
said group of processors of said new group leader. Al- 
ternatively, the method involves requesting appointment 
of said new group leaaer from a name server, said name 
server selecting said new group leader from said mem- 
bership list. Preferably said membership list is located 
at each processor of said group of processors, and said 
obtaining steo comprises obtaining, by a processor of 
said group of processors, said new group leader from 
said membership list at said processor, and further com- 
prising informing a name server of said new group lead- 
er. Preferably, the method further comprising informing, 
oy said name server, said group of processors of said 
new group leader. 

Preferably, the metnod further comprising receiv- 
ing by saic new group leader, prior to said new group 
leaaer being seiectea as said new group leader any 
messages previously sent to said group of processors 
and providing by said new group leader, any messages 
missed by any processor of said group of processors 
Preferably- the method further comprises sending re- 
quests to saio new group leader 

According to an eighth aspect, the present invention 
provides a system for recovering from a failed group 
leader of a group of processors of a distributed comput- 
ing environment, said system comprising a member- 
ship list ordered in sequence of joins of processors to 
said group of processors, and means tor selecting a next 
processor from said membership list as a new group 
leader of said group of processors 

Preferably, said means for selecting comprises 
means for selecting a next active processor from said 
membership list. Preferably, the system further compris- 
ing means for informing said group of processors of said 
new group leader. Preferably, the system further com- 
prising a name server programmable to select said new 
group leader from said membership list. Preferably, said 
membership list is located at each processor of said 
group of processors and said means tor selecting com- 
prises means for selecting, by a processor of said group 
of processors, said new group leader from said mem- 
bership list at said processor, and further comprising 
means for informing a name server of said new group 
leader. Further, comprising means for informing, by said 
name server, satd group of processors of said new 
group leader. Preferably, said system further comprising 
means for receiving, by said new group leader, prior to 



said new orouo leader being selected as said new grouc 
leaaer any messages previously sen: to saio grouo o: 
processors ana means tor orovtainc oy saic new grour 
leaaer. anv messages missed by any Drocessor ot saic 
5 group of processors Preferably saio system turtne: 
comonses means tor sending requests to saic new 
arouo leader 

According to other asoects :ne invention also pro- 
vides computer program products storea on computer- 

io readable storage media for instructing comouter sys- 
tems to carry out the metnod referred to aoove sucn 
products having orogram means for carrying out eacn 
respective methoa step 

The group leader recovery mechanism of the 

i5 present invention provides a flexible tecnnioue tor de- 
termining a new group leaaer wnen the current group 
leaaer fails It ensures thai the members of the group 
are aware of the new group leader and can count on the 
aroup leader to control and manage the group 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

The subject matter which is regarded as the inven- 
tion is particularly pointed out and distinctly claimed in 
25 me claims at the conclusion of the specification The 
foregoing and otner objects features and advantages 
of the invention will be apparent from the following de- 
tailed descnptiontaken in conjunction with the accom- 
panying drawings in which 

30 

FIG. i depicts one example ot a distributed comput- 
ing environment incorporating the principles of the 
present invention: 

35 FIG 2 depicts one example of an expanded view of 

a number of the processing nodes of the distributed 
computing environment ot FIG 1. in accordance 
with the principles of the present invention. 

-io FIG 3 depicts one example of the components of a 

Group Services facility, in accordance with the prin- 
ciples of the present invention 

FIG 4 illustrates one example of a processor group. 
J5 m accordance with the principles of the present in- 

vention 

FIG 5a depicts one example ot the logic associated 
with recovering from a failed group leader ot the 
so processor group of FIG 4 in accordance with the 
principles of the present invention: 

FIG. 5b depicts another example of the logic asso- 
ciated with recovering from a failed group leader of 
55 the processor group of FIG 4. in accordance with 
the principles ot the present invention: 

FIG. 6a illustrates one example of a group leader. 
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,n accordance with tne pr.nc.pies of me present .n- 

venuon 

FIG 6b illustrates a technique lor selecting a new 
Loup .eacer wnen tne current group leader ails in 
accordance w.th the pnne.ples ot me present .nven- 

tion 

FIG 7 dep.cts one example of a name server re- 
ce.v.ng .n.ormat.on from a group leader. ,n accord- 
ance w.th the pr.nc.pies of the present .nvent.on 

Fig = dep.cts one example of the log.c assoc.ated 
w,th adding a processor to a group of processors 
,n accoraance w.th the pr.ncp.es of the present .n- 

venuon. 

FIG 9 dep.cts one example of the log.c assoc.ated 
w.th a processor leav.ng a group of processors, .n 
accordance w.th the pnne.ples of the present inven- 
tion 

FIG 10 illustrates one embodiment of a process 
group -n accordance with the pnne.ples of the 
present invention. 

FIG 1 1 deoicts one example of the logic associated 
with orooosmg a protocol for a process group, .n ac- 
cordance with the pr.nc.ples of the present inven- 
tion 

FIG 1 2 depicts one example of the log.c assoc.ated 
with a process requesting to join a process group 
in accordance w.th the pnne.ples ot the present in- 
vention: and 

FIG 1 3 depicts one example of me log.c assoc.ated 
with a member of a process group request.ng to 
leave tne group, in accordance w.th the pnne.ples 
of the present invention 

DETAILED DESCRIPTION ™= THF PREFERRED 
EMBODIMENTS 



In one embodiment, the techniques of the present 
invention are used in distributed computing environ- 
ments in order to provide multicomputer applications 
that are h.ghly-available Applications that are h.ghly- 
available are able to continue to execute after a failure 
That .s. the application is fault-tolerant and the integrity 
of customer data is preserved 

It is important in highly-available systems to be able 
to coordinate, manage and monitor changes to subsys- 
tems (e.g., process groups) running on processing 
nodes within the distributed computing environment In 
accordance with the principles of the present invention, 
a fac.lrty is provided that implements the above func- 
tions One example of such a fac.l.ty is referred to herein 



as Group Services 

Group Serv.ces is a system-w.de tauii-toierant anc 
n,on.y-available service mat proviaes a facility tor coc- 
oinating managmc and momtor.ng cnanges to a sup- 
svstem running on one or more processors ot a oistr.D- 
uted comout.no env.ronment Group Serv.ces tnroupn 
the techn.ques of the present .nvent.on. provioes an ,n- 
tearated framework for designing and .mplement.ng 
fault-tolerant subsystems and for orov.a.ng consistent 
,o recovery of multiple subsystems Group Services oflers 
a simple prooramm.no model based on a small number 
of core concepts These concepts include m accord- 
ance with the principles of the present mvent.on a clus- 
. terw.de process group membership and synchron.za- 
rs tion service that maintains application specific informa- 
tion with each process group 

As described above m one example the mecha- 
nisms of the present .nvent.on are .ncluded in a Group 
Services facility However the mechanisms of the 
20 present .nvent.on can be used in or w.th various other 
facilities, and thus Group Serv.ces is only one example 
The use ol the term Group Services to include the tech- 
niques of the present invention is for convenience only 
In one embodiment the mechanisms ol the present 
2S invention are incorporated and used in a distributed 
computing environment such as me one depicted .n 
FIG 1 in one example d.str.buted computing env.ron- 
ment 100 includes, for instance, a plurality of frames 102 
coupled to one another via a plurality of LAN gates 104 
30 Frames 102 and LAN gates 104 are described in detail 

below „, 

in one example distributed computing env.ronment 
tOO includes e.oht (3) frames each of which includes a 
plurality of process.ng nodes 106 In one .nstance. each 
55 frame .ncludes sixteen (16) processing nodes (a.k.a. 
processors) Each processing node is for instance, a 
RISC/6000 computer running AIX a UNIX based oper- 
ating system Each processing node within a frame is 
coupled to the other processing nodes of the frame via. 
.0 tor example an internal LAN connection Additionally, 
each frame is coupled to the other frames via LAN gates 

104 

As examples each LAN gate 104 includes either a 
RISC/6000 computer any computer network connec- 
45 tion to the LAN or a network router However, these are 
only examples it will be apparent to those skilled m the 
relevant art that there are other types of LAN gates and 
that other mechanisms can also be used to couple the 
Irames to one another 
so In addition to the above the distr.buted computing 
env.ronment of FIG 1 is only one example. It is possible 
to have more or less than eight frames, or more or less 
than sixteen nodes per frame Further, the processing 
nodes do not have to be RISC/6000 computers running 
ss AIX Some or all of the processing nodes can include 
different types ol computers and/or difterent operating 
systems. All of these variations are considered a part of 
the claimed invention. 
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in one emooa.meni. a Group Services suosystem 
incorporating the mechan.sms of the present invent.on 
is attributed across a plurality of me processing noaes 
of d.striDUtea computing environment 100. In j P^cular 
,n one examole. a Group Serv.ces daemon 200 (F!G^2) 
is locateo within one or more of processing noaes I0o 
The Group Serv.ces daemons are collectively referrea 
to as Group Services. 

Group Serv.ces facilitates for instance, communi- 
cation and synchronization between multiple processes 
of a orocess group, and can be used in a variety of Sit- 
uations, .nciuo.ng. for example, providing a distributed 
recovery synchronization mechanism A process 202 
(FIG 2)des.rousofusingtnetacilit.esofGroupServices 
is coupled to a Group Services aaemon 200. In partic- 
ular, tne process is coupled to Group Serv.ces by Unking 
at least a part of tne code associated with Group Serv- 
ices (e.g.. the library code) into its own code. In accord- 
ance with the principles of the present invention, this 
linkage enables tne process to use me mechanisms of 
tne present invention, as aescribed in detail below. 

in one emDOdiment. a process uses the mecna- 
nisms of the present invention via an application pro- 
gramming interface 204. In particular the application 
programming interlace provides an interface for tne 
process to use the mechanisms of the present invention 
which are induced in Group Services as one examole 
in one emoodiment. Group Services 200 includes an in- 
ternal layer 302 (FIG. 3) and an external layer 304 each 
of which is described in detail below. 

In accordance with the principles of the present in- 
vention, internal layer 302 prov.des a limited set of func- 
tions for externa! layer 304 The limited set of functions 
of the internal layer can be used to build a ncner and 
broader set of functions, which are tmplemented by the 
external layer and exported to the processes via the ap- 
plication programming interface. The internal layer of 
Group Services (also referred to as a metagrouD layer; 
is concerned with the Group Services daemons, and not 
the processes (i.e.. the client processes) coupled to the 
daemons. That is. the internal layer focuses its efforts 
on the processors, which include the daemons In one 
example, there is only one Group Services daemon on 
a processing node however, a subset or all of the 
processing nodes within the distributed computing en- 
vironment can include Group Services daemons 

The internal layer of Group Services implements 
functions on a per processor group basis There may be 
a plurality of processor groups in tne network Eacn 
processor group (also, referred to as a metagroup) in- 
cludes one or more processors having a Group Services 
daemon executing thereon. The processors of a partic- 
ular group are related in that they are executing related 
processes. (In one example, processes that are related 
provide a common function.) For example, referring to 
FIG. 4. a Processor Group X (400) includes Processing 
Node 1 and Processing Node 2. since each of these 
nodes is executing a process X. but it does not include 



Processmc Node 3 Thus Processing Ncoes i anc * 
are memoirs of Processor Grouo X A processing: ncoe 
can oe a memoer of none or any numoer ot orocesso- 
□roups and processor groups can nave one or more 
5 memoers m common 

In order to become a member o: a processor grouc 
a processor needs to recuest to oe a memoer of tna: 
orouo m accordance with tne Dr.nc.Dies of tne present 
invention, a processor requests to oecome a memoer 
w of a particular processor grouo te.c Processor Grouo 
X) when a process related to tnat grouo le g Process 
X> reauests to join a corresponding orocess group ie c 
Process Group X) and the orocessor is not aware of tna: 
corresponaing process group Since the Group Servic- 
es es daemon on the processor handling tne reauest to join 
a particular process group is not aware of tne process 
group, u knows that it is not a member of the correspona- 
ing processor grouo. Thus the processor asks to oe- 
come a member so that the process can become a 
20 member of the process group (One technique tor be- 
coming a member of a processor group is described in 
detail furtner below.) 

Internal layer 302 (FIG 3^ implements a number of 
functions on a per processor group basis These func- 
25 tions include, for example, maintenance of group lead- 
ers, insert, multicast, leave, and fail, each of which is 
descnoed in detail betow 

in accordance with the principles of the present in- 
vention a group leader is selected for each processor 
30 group of the network. In one example, the group leader 
is the first processor requesting to join a particular 
aroup As aescribed herein the group leader is respon- 
sible for controlling activities associated with its proces- 
sor group(s) For example, if a processing node. Node 
25 2 (FIG 4). is the first node to request to join Processor 
Group X. then Processing Node 2 is the group leader 
ana is responsible for managing the activities of Proc- 
essor Group X It is possible for Processing Node 2 to 
be the group leader of multiple processor groups. 
40 it the group leader is removed from the processor 

group tor any reason, including, tor instance, the proc- 
essor requests to leave the group, the processor tails or 
the Group Services daemon on the processor fails, then 
aroup leader recovery takes place. In particular, a new 
45 ^rouo leader is selected. STEP 500a -SELECT NEW 
GROUP LEADER" (FIG 5a) 

in one example, in order to select a new group lead- 
er a membership list for the processor group, which is 
ordered in sequence of processors joining the group is 
50 scanned, by one or more processors of the group, for 
the next processor in the list. STEP 502 "OBTAIN NEXT 
MEMBER IN MEMBERSHIP LIST." Thereafter, a deter- 
mination is made as to whether the processor obtained 
from the list is active, INQUIRY 504 "IS MEMBER AC- 
55 Tl VE 9 " In one embodiment, this is determined by anoth- 
er subsystem distributed across the processing nodes 
of the distributed computing environment. The subsys- 
tem sends a signal to at least the nodes in the member- 
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ship list and if there is no response from a particular 
node u assumes the noae ts inactive 

If tne selected processor is not active tnen tne 
membership list is scanned- again until an active mem- 
ber is located When an active processor is obtained 
from the list, then this processor is tne new group leader 
for the processor group. STEP 506 "SELECTED MEM- 
BER IS NEW GROUP LEADER.' 

For example assume that tnree processing nodes 
joined Processor Group X in the following order: 
Processor 2. Processor i . and Processor 3. 
Thus. Processor 2 is the initial group leaaer (see 
FIG 6a). At some time later. Processor 2 leaves Proc- 
essor Group X. and therefore, a new group leaaer is de- 
sired According to the memoership list for Processor 
Group X. Processor 1 ts tne next group leader. However 
if Processor 1 is inactive, then Processor 3 would be 
chosen to be the new group leader (FIG. 6b). 

In accordance with the principles of the present in- 
vention m one example, the membership list is stored 
in memory of each of the processing nodes of the proc- 
essor group Thus in the above example. Processor 1 
Processor 2. and Processor 3 would all contain a copy 
of tne membership list. In particular each processor to 
join the group receives a copy of the membership list 
from the current group leader In another example, each 
processor to pin the grouo receives the membership list 
from another member of the group other than the current 
group leader 

Referring back to FIG. 5a. in one embodiment of the 
invention, once the new group leader is selected the 
new group leader informs a name server that u is the 
new group leader. STEP 505 "INFORM NAME SERV- 
ER " As one example, a name server 700 (FIG. 7) tsone 
of the processing nodes within the distributed computing 
environment designated to be the name server The 
name server serves as a central location for storing cer- 
tain information including, for instance a list of all of the 
processor groups of the network and a list of the group 
leaders for all of the processor groups. This information 
is stored in the memory of the name server processing 
node The name server can be a processing node within 
the processor group or a processing node independent 
of the processor group 

In one example, name server 700 is informed of the 
group leader change via a message sent from the Group 
Services daemon of the new group leader to the name 
server Thereafter the name server then informs the 
other processors of the group of the new group leader 
via. for example, an atomic multicast. STEP 510 "IN- 
FORM OTHER MEMBERS OF THE GROUP" (FIG 5a) 
(Multicasting is similar in function to broadcasting how- 
ever, in multicasting the message is directed to a select- 
ed group, instead of being provided to all processors of 
a system. In one example, multicasting can be per- 
formed by providing software that takes the message 
and the list of intended recipients and performs point to 
point messaging to each intended recipient using, for 
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examDie. a User Datagram Protocol (UDP) c a Trans- 
mission Control Protocol (TCP) in another emooc - 
ment the message and list of intended recipients are 
passed to the unaerlymg haraware communications 
5 sucn as Ethernet wnich will provide the multicasting 
function.) 

in another embodiment of the invention a memoer 
of the aroup other man the new grouD leaaer informs 
the name server ot tne laentity of the new group leaaer 
w As a further example the processors of the grouo are 
not exDhcitly informed of tne new group leader since 
each processor in the processor grouD has the mem- 
bership list and has determined for itself the new group 
leader 

i5 In yet another embodiment of the invention when 

a new group leader is neeaed a reauest ts sent to the 
name server requesting from the name server the iden- 
tity of the new group leader STE p 500b "REQUEST 
NEW GROUP LEADER FROM NAME SERVER" (FIG 
20 5b). In this embodiment, the membership list is also lo- 
cated at the name server and the name server goes 
through the same steps described above for determin- 
ing the new group leader STEPS 502 504 and 505 
Once it ts determined the name server informs the other 
25 processors of the processor group of the new group 
leader STEP 510 "INFORM OTHER MEMBERS OF 
THE GROUP" 

In addition to the group leader maintenance func- 
tion implemented by the internal or metagroup layer an 
3C insert function is also implemented The insert function 
is used when a Group Services daemon (i.e a proces- 
sor executing the Group Services daemon) wishes to 
pin a particular group of processors As described 
above a processor requests to be added to a particular 
35 processor group when a process executing on the proc- 
essor wishes to join a process group and the processor 
is unaware of the process group. 

In one example in order to become a member of a 
processor group, the processor wishing to join the group 
jo first determines who is the group leader of the processor 
group. STEP 800 "DETERMINE GROUP LEADER" 
(FIG 8) In one embodiment, the group leader is deter- 
mined by providing name server 700 with the name of 
the processor group and requesting from the name serv- 
es er the identity of the group leader for that group. 

Should the name server respond that the request- 
ing processor is the group leader (since this is the first 
request for the group) INQUIRY 801. the requesting 
processor forms the processor group. STEP 803 
so "FORM GROUP " In particular it creates a membership 
list for that particular processor group, which includes 
the requesting processor 

If the processor is not the group leader, then it sends 
an insert request, via a message, to the group leader, 
55 the identity of which is obtained from the name server, 
STEP 802 "SEND INSERT REQUEST TO GROUP 
LEADER " The group leader then adds the requesting 
processor to the processor group. STEP 804 "GROUP 



6 



<EP 090S393A2* 



11 



EP 0 805 393 A2 



mat processor. In pamcuidi. « mlllllP .-. o( tne 

.eader .nforms the other daemons. v.a a mM ^J*™ 
n^ftP me aaemons acknowledge the update, and 
"en e - "eader senos out a commit tor the change 
v "a anotner mu.t.cast. (In another emboo.men Ljne .rv 
t0 rm,ng can be periormed v,a an atomic mo ucast.) In 
one examole. the pming processor ,s added to the end 
o"me memoersh iP since the Us, « ma.nta.ned by 

venuon. a orocessor that » a -ember o, a p o ces sor 
nroun may request to leave the group. S.m.lar to the .n 
iTrerL 'a leave revest .s forwarded ,c > the , g«jup 
teaoer v.a. tor -nstance. a m«sa9 00 S END 

,=AV= REQUEST TO GROUP LEADER (FIG. 9). 
Thereafter the arouo leader removes the processor 
iom the group b"y. tor example, deleting the processor 
memnerLp list and .nform.ng a.l members ^o. 
ne processor group to also remove the Pressor from 
Z« memoersh.o l.st. STEP 902 "GROUP LEADER Rc- 
■Ev™ PROCESSOR FROM GROUP" Additionally rt 
„nr»«or is the qroup leader then group 
the leaving processor is me gru ^ oh ™o 
,eader recovery takes place, as describeo above. 

In add,t,on to the above, it a processor ta-.s. or >t the 
Group Services daemon executing on the processor 
fa i the processor ,s removed from the processor 
o ouo in one embodiment, when the Group Serves 
Lemon fails, tt .s assumed that the processor tails. In 
one example, a failed processor ,s detected by a sub- 
syiem running w.th.n the distributed computing env, 
onment that detects processor failure When there ,s a 
ta ,lure in one instance, the processor .s removed by he 
group leader. In particular, the group .eader deletes the 
processor from ,ts memoersh,p l.st and ,n,orm ,Vne , o her 
member processors to do the same as described 

Another function implemented by the interna, layer 
of Group Services .s a multicast function. In accordance 
w.,h the pr.ncp.es of the present invention, a mem e 
o. a processor group can mu.t.cast a message to the 
other members of the group Th.s mu.ucas, can include 
one-way mu.t.casts. as well as acknowledged mult- 

C3Sl m one embodiment. ,n order to multicast a message 
. Ir0 m one member of a group to other members ^the 
□roup the message sending member sends the mes 
sage to the group leader of the group, and the group 
.eader multicast the message to the other members. 

,n accordance with the principles of the present in- 
vention, prior to sending a message the group leader 
assigns a'sequence number to the 
sequence numbers are kept in numerical order. Thus. 



a memoer of me orocessor group u e ^; e ^" S 
receives a message hav.ng a seouence nurr.oe. eu. c 
orce "t knows tnat ,t nas m.ssed a message rcj in- 
stance, it a orocess,ng node recedes messages -o an. 
c it knows ii missed message 

n accordance w.th tne pnnoo.es of tne present ,n- 
ve n -on the processing node can retrieve 
ITssaoe from any of the processing noces m the pr«x- 
Tssor group, s.nce all of tne nodes ,n tne group nave 
,o Tell me same messages However J" .on. e - 
p.e. the processing node missing me ,n to ™™ « 
qoestsitfromthegroup.eaoer However. ,1 t.sine o . 
?eader that .s m.ssmg the message then « can request 
„ Irom any of the other processing noaes m me o «« 
,5 sor group" Th.s .s poss.b.e s.nce key data .s rep^e 
across all of the process.ng nodes of the -P««£ 
aroup. .n a recoverable fashion There .s no nee d . , , ac 
cordance with the present invention to store ■ the data 
req u.red tor recovery .n persistent ^J^, 
2 c mque of the present .nvent.on •'^•• f J» ° 
persistent stable hardware-based storage tor storing 

'^example, the group leader te? . a new group 
.eader ,s se.ected. as described above The group .ead- 
,5 er ensures that u has all of the messages by com" 1 ""'" 
a, ng with the process.ng nodes of tne group n one 
embodiment, once the group .eader ,s sure that uhas 
a .l of the messages it ensures that all of the other 
pr'oclss-ng nodes'o. the group also have those , m essag^ 
30 as. The technique of the present invention th«s. allows 
recovery from a failed processing node fai.ed process- 
es or link without requiring stable storage. 

,n accordance with the principles of the present in- 
vention, each processor group maintains .ts ow- or- 
ss dered set of messages Thus, the messages for one 
processor group will no, overlap or interfere wrth the 
messages of another processor group The P™** 5 *' 
□roups along with their ordered messages, are inde- 
pendent o, one another. Therefore, one processo, rjovjp 
,o may receive an ordered set of messages of 43 AA . and 
45. white another processor group may receive an ,nde- 
penden,ly ordered set o, messages oM . 2 and . Th,s 
avoids the need for all ,o all commun.cat.on among all 
of the processors of a network 

in one embod.ment of the invention, each process- 
.ng node retains the messages ., rece.ves tor a certain 
amount of i.me. in case ., needs to prov.de the message 
fo another node or ,n case it becomes the group leader 
The messages are saved unt.l the messages are re- 
so ce.ved by all o, the processors of the group. Once the 
messages are rece.ved by all of the processors, then 
tne messages can be discarded. 

,n one example, it .s the group .eader that -ntorms 
tne process.ng nodes that the messages have been re- 
ss ce.ved by a., o, the nodes. Speci.ical.y. .n one example^ 
when a process.ng node sends a message to the group 
teader. it includes an indication o« the fast message that 
has seen <..e.. the last message in proper order) The 
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group leaaer collects this information, ana when it sends 
a message to the processing noaes it inciuaes in tne 
message tne seauence numoer ot tne last message 
seen by all of the noaes Thereafter, the processing 
noaes can aelete those messages mdicatea as oemg 
seen 

In accordance with the principles of the present in- 
vention the multicast stream is advantageously qui- 
esced at certain times to insure all processor group 
members have received all of the messages For exam- 
ple the stream is quiesced when there have been no 
multtcasts for a certain period of time or after some 
number of NoAckRequired (i.e.. no acknowledgment re- 
quired) multicasts have been sent, tn one embodiment 
when tne multicast stream ts to be quiesced. the group 
leader sends out a SYNC multicast, which all processor 
group members acknowledge When a processor group 
member receives such a message, it knows that it has 
(or snould have) all of the messages, based on the se- 
auence number of the SYNC message. If it ts missing 
any messages, it obtains the messages before acknowl- 
eagmc Wnen the group leader receives all of the ac- 
knowledgments to this multicast, it knows that alt proc- 
essor group members nave received all of the messag- 
es, and therefore the multicast stream is synced and 
quiescea 

In another embodiment of the invention a specific 
SYNC multicast is not necessary Instead, one of the 
following techniques can be used to quiesce the multi- 
cast stream As one example, a multicast requiring an 
acknowleagment can be sent from the group leader to 
the processors When a processor receives a multicast 
that requires an acknowledgment, it sends the acknowl- 
edgment to the group leader. The acknowledgment con- 
tains the sequence number of the multicast it is acknowl- 
edging. The processors use this sequence number to 
determine if they are missing any messages If so. they 
request the missing messages from the group leader 
as one example After the group leader multicasts the 
ACKrequired message to all of the processors of the 
group and receives all of the acknowledgments the 
group leader knows that the stream is quiesced The 
non-group leader processors rely on the group leader 
to insure that they receive all the messages tn a timely 
fashion, so they do not need to periodically acknowl- 
edge or ping the group leader to insure they have not 
mtssed a multicast. 

As a further example, in those situations m which 
NoAckRequired multicasts are being used the group 
leader can alter one of the NoAckRequired multicasts 
into an AckRequired multicast, thus using it as a sync 
in the manner described above. Thus, no explicit SYNC 
message is required. 

In addition to the above, in another example it is 
possible for the non-group leader processors to antici- 
pate the group leader's action, such that if the number 
of NoAckRequired messages approaches the window 
size (i.e.. e.g., reaches a predetermined number, such 



as five m one example) or if a maximum tdie time a?- 
proacnes tne non-group leaaer processors can senc a- 
ACK to tne arouo leader The ACK proviaes to tne grouc 
leaaer tne highest sequence number multicast mat eacn 
£ processor nas received If all of tne non-grouo teaae- 
processors ao this then it is not necessary for tne grouc 
leader to turn a NoAckRequired multicast into an Ack- 
Required multicast Therefore the group is not nela up 
by waiting for all of the acknowieagments 
io Support for tne aoove feature of tne present inven- 
tion is transparent to the users of Group Services (i e 
the processes). No explicit actions are necessary by the 
processes to imDlement this feature Additionally this 
support is available in the internal and external layers oi 
is Group Services 

Referring back to FIG 3 external layer 304 imple- 
ments a richer set of mechanisms of the application pro- 
gramming interface that is easy for tne user (i e the 
client processes) to understand 
20 in one example, these mechanisms include an 
atomic multicast a 2-phase commit, barrier synchroni- 
zation, process group membersnip processor group 
membership, and process group state value each of 
which is described below These mechanisms, as well 
2S as others, are unified, in accordance with the principles 
of the present invention, by the application programming 
interface into a single unified framework that is easy to 
understand In particular, communications and synchro- 
nization mechanisms (in addition to other mechanisms) 
jo have been unified into a single protocol 

In accordance with the principles of the present in- 
vention the single, unified framework is provided to 
members of process groups as described m detail here- 
in A process group includes one or more related proo- 
fs esses executing on one or more processing nodes of 
the distributed computing environment For example, 
referring to FIG 10 a Process Group X (1000) includes 
a Process X executing on Processor 1 and two Process 
X's executing on Processor 2 The manner in which a 
-to process becomes a member of a particular process 
group is described in detail further betow 

Process groups can have at least two types of mem- 
bers including a provider and a subscriber A provider 
is a member process that has certain privileges, such 
45 as voting rights, and a subscriber has no such privileges 
A subscriber can merely watch the ongoings of a proc- 
ess group, but cannot participate in the group. For ex- 
ample, a subscriber can monitor the membership of a 
group, as well as the state value ot the group but it can- 
so not vote In other embodiments, other types of members 
with differing rights can be provided 

In accordance with the principles of the present in- 
vention, the application programming interlace is imple- 
mented, as described below with reference to FIG 11. 
55 Referring to FIG . 11 . in one example, initially, a pro- 

vider of a process group proposes a protocol for the 
group (subscribers cannot propose protocols, in this 
embodiment), STEP 1100 "MEMBER OF PROCESS 
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GROUP PROPOSES A PROTOCOL FOR THt 
GROUP " m pantcular. in one instance, an API call is 
made proposing tne protocol. In one example, tne pro- 
tocol is submitted, by a process, to the external layer of 
tne Grouo Services oaemon on tne processor executing 
the process That Group Serv.ces daemon then suDm.ts 
the protocol to the group leaoer of me group va a mes- 
sage The grouD leader then mtorms. via a multicast, all 
of the processors ot the related processor group of the 
protocol (The internal layer of the oaemon is managing 
this multicast. ) Tnose processors then inform the appro- 
priate memoers of the process group, via the external 
layer, of the oroposed protocol. STEP 1102 "INFORM 
PROCESS GROUP MEMBERS OF THE PROTOCOL." 

If multiDle proviaers propose a protocol at the same 
time, then the group leader selects tne protocol to be 
run in tne following manner. In one embodiment, tne 
protocols are prioritized in that any protocol for a failure 
is first a join protocol is second, and all other protocols 
(e.g.. requests to leave, expel, update state value and 
provide a aroup message, described below) are on a 
first come first served basis. Thus, if a request to remove 
a memoer due to a failure is proposed at the same time 
as a reauest to join and a request to leave, then the re- 
quest to remove is selected first. Then, the request to 
join is selected, followed Dy the request to leave. 

if mere are multiDie requests to remove due to fail- 
ure, men all of these requests are selected prior to tne 
request to |Oin. The requests to remove are selected Dy 
the group leader in the order seen by the group leaoer 
(unless batching is allowed, as described below) Simi- 
larly, if tnere are multiple request to join, then these are 
selected in a likewise manner prior to any of the other 
requests. 

In one embodiment, if tnere are multiple other re- 
quests, the first one received by the group leader is se- 
lected and the others are dropped. The group leader in- 
forms the providers of those dropped requests that they 
have been dropped and then, they can resubmit them if 
they wish In another embodiment of the invention 
these other requests can be queued in order of receipt 
and selected in turn, instead of being dropped. 

After a protocol is selected, a determination is made 
as to wnether voting should be performed for the proto- 
col. INQUIRY 1104 "VOTING?" In one example, the 
process proposing the protocol indicates during the ini- 
tial proposal whether voting is to take place. If the pro- 
posal indicates no voting, then the protocol is simply an 
atomic multicast, and the protocol is complete. STEP 
1106 "END." 

If voting is to take place, then each provider of the 
process group votes on the protocol. STEP 1105 
-PROCESS GROUP MEMBERS WITH VOTING PRIV- 
ILEGES VOTE." Specifically, in accordance with the 
principles of the present invention, the voting allows 
each provider to take local actions necessary to satisfy 
the group, and to inform the group of the results of those 
actions. This functions as a barrier synchronization 



orimitive by ensuring tnat all providers have reacnea a 
particular point before proceeomg 

in one embooimeni ot the present invention eacr. 
provider votes by casting a vote value wn.cn may in- 
s ciuae one of the following as an examoie 

(a) APPROVE specifying tnat tne provider wishes 
to complete the protocol once all of the Droviaers 
have reached this barrier, ana to accent all the pro- 

70 posed changes. 

(b) CONTINUE specifying that tne provider wishes 
to continue the protocol through another voting 
step, and proposed changes remain pending, and 

75 

<c) REJECT specifying that the provider wishes to 
end this protocol once all the providers have 
reached this barrier, and to reject those proposed 
changes that can be rejected 

in accordance with the principles of the presennn- 
vention. each provider of the process group forwards us 
vote to the Group Services daemon executing on the 
same processor as the process The Group Serv.ces 
daemon then forwards the vote values it receives to the 
group leader for the metagroup associated with that 
process group For.nstance the vote values lor Process 
Group X are forwarded to the group leader of Processor 
Group X. Based on the vote values the group leader 
determines how the protocol should proceed. The group 
leader then multicasts the result of the voting to each of 
the processors of the appropriate processor group (i.e.. 
to the Group Services daemons on those processors), 
and the Group Services daemons inform the providers 
35 of the result value For example, the group leader in- 
forms the Group Services daemons of Processor Group 
X and the Group Services daemons provide the result 
to the providers of Process Group X 

If one of the providers voted CONTINUE and none 
jo of the providers voted REJECT INQUIRY 1110 "CON- 
TINUE VOTING?", then the protocol proceeds to anoth- 
er voting step. STEP 11 OS That is the providers are 
performing barrier synchronization with a dynamic 
number of synchronization phases In particular in ac- 
js cordance with the principles of the present invention, the 
number of voting steps (or synchronization phases or 
points) that a protocol can* have is dynamic, It can be 
any number of steps desired by the voting members. 
The protocol can continue as long as any provider wish- 
so es for the protocol to continue Thus, m one embodi- 
ment, the voting dynamically controls the number of vot- 
ing steps. However, in another embodiment, the dynam- 
ic number of voting steps can be set during the initiation 
of the protocol. It is still dynamic, since it can change 
55 each time the protocol is initialized. 

If the providers vote not to continue to another vot- 
. ing step, then the protocol is a 2-phase commit. Alter 
the voting is complete (either for a two-phase or multi- 
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phase vote) the result of the vote is proviaed to the 
memoers In canicular, snoula any one provider of tne 
process group vote REJECT then the protocol enas anc 
tne prooosed cnanges are reacted. Each of tne prov.o- 
ers is informed via a multicast, that the protocol has 
been rejected STEP 1112 "INFORM MEMBERS OF 
COMPLETION OF PROTOCOL." On the other hand, if 
all of the providers voted APPROVE, then the protocol 
is complete and all of the proposed cnanges are accept- 
ed The providers are informed of the approved protocol, 
via a multicast STEP 1112 "INFORM MEMBERS OF 
COMPLETION OF PROTOCOL." 

In accordance with the principles of the present in- 
vention, the aoove-described protocol is also integrated 
with process group membership and process group 
state values In particular, the mechanisms of the 
present invention are used to manage and monitor 
membership changes to the process groups. Changes 
to group membership are proposed via the protocol de- 
scribed above Additionally, the mechanisms of the 
present invention mediate changes to the group state 
value and guarantee that it remains consistent and re- 
liable, as tong as at least one process group member 
remains 

A group state value for the process group acts as a 
syncnronized blackboard for the process group In one 
embodiment, the group state value is an application 
soeciiic value controlled by tne providers The group 
state value is part of the group state data maintained for 
each process group by Group Services. In addition to 
the group state value, the group state data includes a 
provider membership list for that group Each provider 
is identified by a provider identifier and tne list is ordered 
by Group Services such that the oldest provider (the first 
provider joining the group) is at the head of the list, and 
the youngest is at tne end 

Changes to the group state value are proposed by 
group members u e.. the providers) via the protocol de- 
scribed above. In one embodiment tne contents of the 
group state value are not interpreted by Group Services 
The meaning of the group state value is attached by the 
group members The mechanisms of the present inven- 
tion guarantee that all process group members see the 
same sequence of changes to the group state values 
and that all process group members will see the up- 
dates 

Thus, as described above, the application program- 
ming interface of the present invention provides a single, 
unified protocol that includes a plurality of mechanisms 
including, for example, an atomic multicast. 2-phase 
commit, barrier synchronization, group membership 
and group state value. The manner in which the protocol 
is used for group membership and the group state value 
is described in further detail below 

The voting mechanism described above is used, in 
accordance with the principles of the present invention, 
to propose changes to the membership of a process 
group. For instance, if a process wishes to join a partic- 



ular process group such as Process Grouo X tnen :na: 
crocess issues a join call STEP 120C "INITIATE RE- 
QUEST TO JOIN" (FIG *2) in one emocoiment mis 
call is sent as a message across a local communications 
5 path (e.c a UNIX oomam socket! to the Group Services 
daemon on the processor executing tne reauestmc 
process The Group Services daemon senos a mes- 
sage to the name server asking tne name server tor tne 
name of the group leaoer tor the process group that tne 
to requesting process wishes to join STEP 1 202 "DETER- 
MINE GROUP LEADER " 

If this is the first request to pin the particular Drocess 
group, then the name server informs tne GrouD Services 
daemon that it is the group leader INQUIRY 1204 
is -first REQUEST TO JOIN?" Thus, the processor cre- 
ates a processor group as aescrtbed above and adds 
the process to the process group STEP 1210 "ADD 
PROCESS J In particular the process is added to a 
membership list for that process group This member- 
20 ship list is maintained by Group Services, tor example 
as an ordered list. In one example, it is ordered in se- 
quence of joins. The first process to join is first in the 
list, and so forth. 

In accordance with the principles of the present in- 
25 vention. the first process to join a process group identi- 
fies a set of attributes for the group These attributes are 
included as arguments m the pin call sent by the proc- 
ess These attributes include, for instance the group 
name, which is a unique identifier, and prespecified in- 
30 formation that defines to Group Services how the group 
wishes to manage various protocols For instance, the 
attributes can include an indication of whether the proc- 
ess group will accept batched requests as described 
below Additionally in another example the attributes 
35 can include a client version number representing, tor ex- 
ample, the software level of the programming in each 
provider This will ensure that all group members are at 
the same level The above-described attributes are only 
one example Additional or different attributes can be 
-to included without departing from the spirit of the claimed 
invention 

Returning to INQUIRY 1204 "FIRST REQUEST TO 
JOIN?" if this is not the first request to join then the pin 
request is sent via a message to the group leader, des- 

J5 ignated by the name server STEP 1214 "SEND JOIN 
REQUEST TO GROUP LEADER" The group leader 
then performs a prescreemng test. STEP 1216 "PRE- 
SCREEN " In particular, the group leader determines 
whether the attributes specified by the requesting proc- 

so ess are the same as the attributes set by the first process 
of the group If not. then the join request is rejected 

However, if the prescreen test is successful, then 
the providers of the process group are informed of the 
request via. for instance a multicast from the group 

55 leader, and the providers vote on whether to allow the 
process to be added to the group. STEP 1220 "VOTE. 
" The voting takes place, as described above. The pro- 
viders can vote to continue the protocol and vote on this 
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,o.n aaa.n. or tney can voie to re,ect or approve tne ,o.n. 
II one of the providers votes REJECT, men tne join is 
terminated and the process is not added to tne group, 
INQUIRY 1222 'SUCCESSFUL?'. However, rt all of the 
prov.aers vote APPROVE, then the process is added to 
the group STEP 1224 "ADD PROCESS.' In particular, 
the process is adaed to tne end of tne membership list 
for the group. Once the protocol is complete, the mem- 
bers of tne croup are notified of the result. In particular, 
m one examoie. all of the memoers (including the pro- 
v.aers and subscnoers) are notified wnen the process 
is aaaea. out only the prov.aers are notified wnen the 
protocol nas been reacted. In anotner examoie. other 
types of memoers may aiso be notified, as aeemed ap- 
propriate. 

' join reauests are used by prov.ders to pin a proc- 
ess group, as oescnoeo aoove. A provider is afforded 
certain benefits, such as voting rights. Processes can 
aiso subscr.oe to a process group, however, by issuing 
an API subscribe call (as opposed to a join call). A sub- 
scriber is provided the ability to monitor a particular proc- 
ess group, but not to participate in tne group. 

When a subscribe call is issued, it is forwarded to 
tne Grouo Services aaemon on tnat processor and that 
Grouo Serv.ces daemon keeps track of it. If the Group 
Services aaemon is not a part of tne processor group, 
tnen it will become inserted into the group, as previously 
aescnoed In one emboatment. there is no voting for tne 
subscr.oer. and other members of the group, including 
the providers and any other subscribers, are not aware 
of the subscriber. A subscrioer cannot subscribe to a 
process arouo that is not already created 

Group memoership can also be altered by a group 
member leaving or being removed from a group. In one 
examoie. a group memDer wishing to leave a group 
sends a request to leave to the group leader, in the man- 
ner aescr.bed above. STEP 1300 "INITIATE REQUEST 
TO LEAVE" (FIG. 13). The group leader sends a multi- 
cast to the providers requesting the providers to vote on 
tne proposed change. STEP 1302 "VOTE." The vote 
takes place in the manner described above, and if all ot 
the prov«oers vote APPROVE. INQUIRY 1 304 then the 
process is removed from the memoership list for that 
process group. STEP 1 306 "REMOVE PROCESS " ana 
all of the group members are notified of the change 
However, if one of the providers votes REJECT then the 
process remains a part of the process group, the proto- 
col is terminated, and the providers are notified of the 
rejected protocol. Of course, if none of the providers 
votes REJECT and any one of the providers votes CON- 
TINUE, then the protocol continues to another round of 
voting. 

A member of a group may leave the group involun- 
tarily when it is expelled from the group via an approved 
expel protocol proposed by another process of the 
group, or when the group member fails or the processor 
in which it is executing fails. The manner in which an 
expulsion is performed is the same as that described 



above for a memoer reauest.nc to leave a grouo excec: 
tnat tne recuest is not m.t.atea oy a process wisnmc tc 
teave but instead by a process oes.nnc to remove an- 
other process from the grouD 

Likewise, m one embodiment tne tecnniaue tor re- 
moving a Drocess wnen tne orocess fans or wnen tne 
processor executing the process fans is similar to mat 
techniaue used to remove a process reauest.ng to 
leave However, instead of tne process initiating a re- 
io quest to leave, the request is initiated oy Group Servic- 
es as described below 

in the case of a process ta.lure in one examoie me 
group leader is informed of the failure by the Grouo 
Services daemon running on me processor ot the tailed 
75 process Tne Group Services aaemon aeterm.nes mat 
the process has failed, wnen it detects mat a stream 
socket (known to those skilled in the art) associated with 
the process has failed The group leader then initiates 
the removal. 

20 in the case ot a processor ta.lure. the group leader 

detects this failure and initiates the request to remove 
It ,t is the group leader that has tailed then group leader 
recovery is performed, as described herein, before the 
reauesi is initiated. In one emDoaiment. the group leader 
25 is mtormed of the processor failure by a subsystem that 
is distributed across the processing nodes of the net- 
work This subsystem sends out signals to all of the 
processing nodes and if the s.gnal is not acknowledged 
by a particular node, that node is considered down (or 
30 tailed) This information .s then broadcast to Group 
Services. 

As described above, when a process wishes to join 
a group or a group member w.shes to leave or is re- 
moved from the group, the group leader informs each . 
35 of the group providers of the proposed change, so that 
the providers can vote on that change In accordance 
with the principles ot the present invention, these pro- 
posed membership changes can be presented to the 
group providers either singly (. e.. one proposed group 
40 membership change per protocol) or batched (i.e. mul- 
tiple proposed group membership changes per proto- 
col) In the case of batched requests, the group leader 
collects the requests for a prespecified amount ot time, 
as one example, and then presents to the group provid- 
es ers one or more batched requests. Specifically, one 
batched request is provided, which includes all of the 
join requests collected during that time, and another 
batched request is provided, which includes all of the 
leave or remove requests collected. In one embodiment. 
so one batched request can only include all joins or all 
leaves (and removals), and not a combination of both. 
This is only one example. In other examples, it is pos- 
sible to combine both types ot requests. 

When a batched request is forwarded to the group 
55 providers, the group providers vote on the entire 
batched request, as a whole. Thus, either the entire 
batch is accepted, continued or rejected. 

In accordance with the principles of the present in- 
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vention. each process group can determine wnetner it 
is willing to allow requests to oe batcned or not. Addi- 
tionally, each process grouD can determine whether 
some types of requests are allowed to oe batched, wmle 
others are not For instance, assume there are a number 
of process groups executing in the network Process 
Group W can decide that it wants to receive batched 
requests for all types of requests, while Process Group 
X can indepenaently decide that it wants to receive all 
requests serially. Additionally. Process Group Y can al- 
low batcned reauest for only join requests, while Proc- 
ess Group Z allows batched requests only for leave or 
removal requests. Thus, the mecnanisms of the present 
invention provide flexibility in how requests are present- 
ed and voted on. 

Although tne system is flexible, there a number of 
rules that have oeen instituted in one embodiment of the 
invention to ensure consistent and reliable group mem- 
bershio These rules include the following, as one ex- 
ample 

1 No group member can be shown to be failing and 
leaving the group before it has joined the group. 

2 No grouD member can be shown to be joining a 
aroup a second time, before its initial failure has 
Deen handled 

3 Where a group has both requests to join and has 
established members in a failed state, ail of the 
failed members are dealt with (via one or more of 
tne failure protocols) before any of the requests to 
join can be satisfied 

4 AH non-failed group providers, including those re- 
questing to join see the same sequence of proto- 
cols and membership lists 

Describea above in detail is how the voting protocol 
of the present invention is used to manage group mem- 
bership The voting protocol can also be used, however, 
to propose a group state value, in accordance with the 
principles of the present invention. In particular, during 
a voting phase, a provider of the process group can pro- 
pose to change the state value of the group m addition 
to providing a vote value. This provides a mechanism to 
allow group providers to reflect group information relia- 
bly and consistently to other group members In one ex- 
ample the group state value (and other information 
such as. a message, and an updated vote value as de- 
scribed herein) is provided with the vote value via a vote 
interface that allows for various arguments to be pre- 
sented. 

For example, when a member joins or leaves the 
group, the group is driven through a multi-step protocol, 
as described above. During each voting step, the group 
members perlorm local actions to prepare for the new 
member, or to recover from the loss of the failed mem- 



ber Based on tne results of tnese local actions ic f in- 
stance, one or more of the oroviaers may aeciae to moc- 
ify the grouo state value Inoneexamoie the grouo state 
value can be 'active " indicating that the process grouc 

5 is ready to accent service requests "inactive " maicatmc 
that the process group is shutdown because, for in- 
stance, the group does not have enough memoers or 
"suspend " indicating that the process grouo will accent 
requests, but is temporarily not Drocesstng tne reauests 

10 Group Services guarantees that tne uodates to the 
group state value are coordinated sucn that tne grouo 
providers will see the same consistent value It the pro- 
tocol is APPROVED then the latest updated oroDosed 
group state value is the new group state vaiue If the 

15 protocol is REJECTED then the group's state value re- 
mains as it was before tne rejectee protocol began ex- 
ecution. 

In accordance with the principles of the present in- 
vention, the voting protocol can also be used to multicast 

20 messages to the group members For example in ad- 
dition to providing a vote value, a provider can include 
a message that is to be forwarded to all other members 
of the process group Unlike the group state value, this 
message is not persistent Once it is shown to the group 

25 memoers Group Services no longer keeps track of it 
However Group Services does guarantee delivery to all 
non-failed group providers 

The message can be used by a group provider tor 
instance, to forward significant information during the 

30 protocol that cannot be carried by the other responses 
within a vote For example, it can be used to provide 
information that cannot be reflected in the provider's 
vote value or to provide information that does not need 
to be made perststent In one example it can inform the 

35 group members of a particular function to perform 

In accordance with one embodiment of the present 
invention each provider of a process group is expected 
to vote at a voting phase of a protocol Until all of the 
providers vote, the protocol remains uncompleted 

JO Thus a mechanism is provided in the voting protocol, 
in accordance with the principles of the present inven- 
tion, in order to handle the situation in which one or more 
providers have not provided a vote In particular, the vot- 
ing mechanism includes a default vote value, which is 

4 5 explained in detail below. 

As examples a default vote value is used when a 
provider fails during the execution of the protocol or 
when the processor in which the provider is executing 
tails or if the provider becomes non-responsive, as de- 

50 scribed herein The default vote value guarantees for- 
ward progress for the protocol and for the process 
group A process group initializes its default vote value 
when the group is first formed by. for example, its at- 
tributes In one embodiment, the default vote value can 

55 either be APPROVE or REJECT. During each voting 
phase the default vote value can be changed to reflect 
changing conditions within the group 

In the situation in which a process fails during the 
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protocol Group Services Determines this, as descnoec 
aoove. and mus. at any vot.ng pnase tor me protocol, 
the group leaoer will submit tne group's current aefault 
vote (or tne failed process. Similarly, if Group Services 
determines that tne processor executing a memoer pro- 
vioer has failed, then tne group leader once again sub- 
mits a default vote. 

If however, a processor or process is available but 
non-resoons.ve. then tne default vote value can also be 
used in one example, a process is aeemed non-respon- 
sive wnen it aoes not respond to a vote within a time 
limit set by the process group for that protocol. (Each 
protocol for each process group can nave its own time 
limit.) When the process is non-responsive, the default 
vote value assigned to the process group is used by the 
group leader for this particular process. In one embod- 
iment. it is possible to have no time limit. In that situation. 
Group Services will wait until the provider eventually re- 
sponds or until it fails. 

in one emoodiment. when a default vote is used, 
the proviaers are informed of this. 

In accordance with the principles of the present in- 
vention, a provider can aynamicalty update the default 
vote value at any one or more of the voting steps within 
the protocol. This allows flexibility in the handling of fail- 
ures, as tne protocol progresses. The proposed default 
value is supmitted along with the vote value of the proc- 
ess The new default vote value remains in effect for the 
remainder of the protocol, unless another default vote 
value is proposed at a later voting step. If multiple default 
vote values are proposed at a panicular voting step 
then in one emboaiment. Group Services (i.e. . tne group 
leader) selects the value submitted by the first process 
to respond Once tne protocol is complete, the default 
vote value for the process group reverts back to the val- 
ue initially set for the group. 

A default vote value is treated in the same manner 
as any other vote value. However, default vote values 
cannot, in one embodiment, include other information 
for the vote, such as. for instance, a message, a group 
state value or a new proposed updated default vote val- 
ue 

As described above with reference to FIG 11. all of 
the above-described proposed protocols can be pro- 
posed as one-phase protocols in which the protocol is 
proposed and accepted in one multicast. Therefore, it is 
not necessary to take a vote. 

Described in detail above are mechanisms for en- 
suring highly-available multicomputer applications As 
one example, the mechanisms of the present invention 
can be used for providing a fault-tolerant and highly- 
available system. The mechanisms of the present in- 
vention advantageously provide a general purpose fa- 
cility for coordinating, managing and monitoring chang- 
es to the state of process groups executing within the 
system. 

In accordance with the principles of the present in- 
vention, membership within processor groups and proc- 



ess arouDS can be dynamically uodatec in botn cases 
processors or processes can reauest to oe acoe= or re- 
moved from a arouo The mechanisms of tne presen: 
invention ensure mat these changes are penormea con- 
£ sistently and reliably 

Additionally in accordance with the principles of the 
present invention, mechanisms are provided for ena- 
bling messages to be sent to one or more particular 
groups of processors without navin? to send tne mes- 
io sages to all of the processor groups Eacn processor 
arouo has tne ability to monitor ana manage us own set 
of messaaes and for determining .t one or more mes- 
sages has been missed. If a message nas Deen misseo 
that messaoe is then retrieved from another memoer of 
is the grouo There is no need to maintain stable storage 
for these messages Each memoer of tne group has tne 
messages, and thus, can provide missing messages to 
other members. This aavantageously reouces tne costs 
of hardware. 

20 Further in accordance with the principles of the 

present invention, mechanisms are provided tor recov- 
ering from a failed group leader These mechanisms en- 
sure that a new group leader is selected easily and ef- 
ficiently. 

25 The mechanisms of the present invention also pro- 

vide an application programming interface that unifies a 
numoer of protocols into one single integrated frame- 
work for the processes. As one example, tne integrated 
application programming interlace provides a facility for 
30 communicating between members of process groups 
as well as a facility for synchronizing processes of a 
process group Additionally, the same interface provides 
a facility for dealing with membership changes to proc- 
ess groups, as well as changes to group state values 
35 The application programming interlace also in- 

cludes a mechanism that enables Group Services to 
monitor the responsiveness of the processes. This can 
be performed in a similar fashion as to a ping mecha- 
nism used in computer network communications. 
40 in addition to the above, the mechanisms of the 

present invention provide a dynamic barrier synchroni- 
zation technique. In accordance with the principles of 
the present invention, the number of synchronization 
phases included in any one protocol is variable, and can 
js be determined by the members voting on the protocol 
The mechanisms of the present invention can be 
included in one or more computer program products in- 
cluding computer useable media in which the media in- 
clude computer readable program code means for pro- 
50 viding and facilitating the mechanisms of the present in- 
vention. The products can be included as part of a com- 
puter system or sold separately 

The flow diagrams depicted herein are just exem- 
plary. There may be many variations to these diagrams 
55 or the steps described therein without departing from the 
spirit of the invention. For instance, the steps may be 
performed in a diflering order, or steps may be added, 
deleted or modified. All of these variations are consid- 
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ered a part of the claimed invention. 

Althouah pref errea emooaiments nave been depict- 
ed and aescnoed in detail herein, it will be apparent to 
those skilled m the relevant an that various modifica- 
tions additions substitutions and the like can be made 
without departing from the spirit of the invention and 
these are tnerefore considered to be within the scope of 
the invention as defined in the following claims. 



Claims 

1 . A method for joining a group of processors in a dis- 
tributed computing environment, said method com- 
prising the steps of 

requesting, by a processor to join a group of 
processors, said group of processors executing 
related processes and 

adding said processor to said group of proces- 
sors 

2. A method for maintaining grouDS of processors in a 
distributed computing environment, said method 
comprising the steps of. 

identifying a specified action to be taken for a 
aroup of processors of said distributed comput- 
ing environment, said group of processors in- 
cluding one or more member processors, each 
of said one or more member processors includ- 
ing a related process, and 

performing said specified action for said group 
of processors 

3. The metnod of claim 2. wherein said specified ac- 
tion is selected from the following list 

(a) insert, wherein a processor is requesting to 
join said group of processors. 

(b) multicast, wherein one of said one or more 
member processors is requesting to forward a 
message to any other member processors of 
said group of processors: 

(c) leave, wherein one of said one or more 
member processors is requesting to leave said 
group of processors: 

(d) remove, wherein one member of said one 
or more member processors is removed from 
said group of processors, when said one mem- 
ber fails: and 

(e) maintaining a group leader for said group of 



processors 

4. The method of any preceding claim wneretr a 
aroup leader for said grouo of orocessors is mam- 

5 tamed and said metnod being caoabie of recover- 

ing from a failed group leaaer by 

obtaining from a memoershio list ordered in se- 
auence of loins of orocessors to saia grouD of 
i0 processors a next processor in said member- 

ship list: and 

selecting said next processor as a new group 
leader of said grouo of processors 

5. A system for joining a grouD of processors in a dis- 
tributed computing environment said system com- 
prising 

20 a processor programmable to request to join a 

group of processors said grouo ot processors 
executing related processes and 

means for adding said processor to said group 
25 of processors 

6. A system for maintaining groups of processors in a 
distributed computing environment said system 
comprising- 

30 

means for identifying a specified action to be 
taken for a group of processors of said distrib- 
uted computing environment said group of 
processors including one or more member 
35 processors each of said one or more member 

processors including a related process: and 

means for performing said specified action tor 
said group of processors 

-io 

7. The system of claim 6 wherein said specified action 
is selected from the following list 

(a) insert wherein a processor is requesting to 
-is join said group of processors 

(b) multicast, wherein one of said one or more 
member processors is requesting to forward a 
message to any other member processors of 

so said group of processors 

(c) leave wherein one of said one or more 
member processors is requesting to leave said 
group of processors 

55 

(d) remove, wherein one member of said one 
or more member processors is removed trom 
said group of processors, when said one mem- 
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oer tans anc 

(e > maintains a group leader tor sa.d group of 



processors 

8. 



Tne system ol any preced.ng clam, wherein a 
orouo leader tor sa.d group of processors .s main- 
^nec. and sa.d system Pe.ng capable ■ oU recove - 
,ng from a tailed group leader Py -nc.ud.ng the tur 

tner elements 

means tor cbta.n.ng trom a membership list or- 
aered in seauence of ,o.ns of processors to sa.d 
group of processors a next processor ,n sa.d 
memoership list: ana 

m eans tor select.ng sa.d next processor as a 
new group leader of sa.d group of processors. 

A computer program product stored on a computer 
Teaoable storaae med.um contain.ng software code 
tor perlorm.ng the tunct.ons listed above .n any pre- 

ceamg method ctaim 
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